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Foreword 



CEC's policy on inclusive schools and community settings invites ail 
educators, other professionals, and family members to vork together to 
create early intervention, educational, and vocational programs and 
experiences that are collegial, inclusive, and responsive to the diversity 
of children, youth/ and young adults. Policymakers at the highest levels 
of state /provincial and local government, as well as school administra- 
tion, also must support inclusive principles in the educational reforms 
they espouse. 

One area in v/hich the inclusion of students with disabilities is 
critical is the development and use of new forms of assessment. This is 
especially true when assessment becomes a tool by which local school 
districts, states, and our nation show accountability for the education of 
students. 

As multidimensional instruments that can cross curriculum areas, 
performance assessments have the p tential to be powerful instruc- 
tional tools as well as tools for accountability* As this new technology 
is applied in creating new assessment instruments, students with dis- 
abilities must be considered during the design of the assessment, ad- 
ministration, scoring, and reporting of results. 

CEC is proud to contribute this Mini-Library to the literature on 
performance assessment, and in so doing to foster the appropiate inclu- 
sion of students with disabilities in this emerging technology for instruc- 
tion and accountability. 
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Preface 



Performance assessment, authentic assessment, portfolio assessment — these 
are the watchwords of a new movement in educational testing. Its 
advocates say this movement is taking us beyond the era when the 
number 2 pencil was seen as an instrument of divine revelation. Its critics 
say it is just another educational bandwagon carrying a load of untested 
techniques and unrealistic expectations. 

Despite the criticisms and reservations that are sometimes ex- 
pressed, these new approaches are being implemented in a growing 
number of large-scale assessment programs at federal, state, and district 
levels. They are also finding their way into small-scale use at school and 
classroom levels. 

What about students with disabilities? Are the new assessment 
techniques more valid than conventional assessment techniques for 
these students? Are the tect niques reliable and technically sound? Will 
they help or hinder the inclusion of students with disabilities in large- 
scale assessment programs? Can classroom teachers use the techniques 
to assess student learning and possibly enrich the classroom curriculum? 

The following fictional vignettes illustrate some of these issues. 

Vignette 1 

The State of Yorksylvania developed educational standards 
and a statewide system of student assessments to monitor 
progress in achieving the standards. The use of standardized 
multiple-choice tests was rejected because these tests were 
. thought to trivialize education. It was feared that teachers 
would "teach down" to the tests rather than "teach up" to the 
standards. So, committees of teachers, parents, and employ- 
ers were formed to translate the standards into "authentic" 
performance assessments. The resulting assessment system 
was called the Yorksylvania Performance Inventory (YPI). 
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Once a year, students from every school in the state were 
administered the YPI, which consisted of several assess- 
ments, each of which required up to 3 days to complete. 
Students worked, sometimes individually and sometimes in 
small groups, on tests involving complex, high-level tasks 
that crossed curriculum areas. In one task, students individu- 
ally did research and answered essay questions interrelating 
the geography, wildlife, and history of their state. In another 
task, students worked in groups to design a car powered by 
fermentation. Schools were provided with practice activities 
and curriculum guides to encourage the infusion of perf< <rm- 
ance assessment activities into the school curriculum. 

The state policy allowed special education students to be 
included in the YPI, excluded, or provided with special modi- 
fications, dependmg on their individual needs as indicated in 
their individualized education programs. Initially, most spe- 
cial education teachers supported the YPI because they felt it 
eliminated some artificial barriers (reading, test-taking skills, 
etc.) that put their students at a disadvantage on other types 
of tests. However, there were some questions and issues, such 
as the following: 

• Some of the YPI tasks involved a lot of reading, more than 
was found on previous types of tests. 

• Special education teachers sometimes felt pressured to 
exclude their students from testing in order to increase the 
school's scores. 

• Special education students sometimes experienced ex- 
treme frustration in the YPI assessments, many of which 
bore no resemblance to these students' other schoolwork. 

• Some parents of special education students questioned 
whether the standards were really applicable to their chil- 
dren and whether the YPI was diverting instruction from 
more relevant and important topics. 

Vignette 2 

A teacher named Pat had students at a wide range of func- 
tioning levels, including a number of mainstreamed students 
receiving special education services. Pat was always on the 
lookout for new ideas and approaches. Pat began reading 
articles and attending conferences on new assessment ap- 
proaches termed portfolio assessment, authentic assessment, per- 
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formancc assessment, and alternative assessment. These ap- 
proaches seemed to make a lot of sense, and Pat decided to 
try them out. One of the first approaches Pat tried was 
authentic assessment. Rather than simply testing students on 
their rote learning of skills and content, Pat began to look for 
ways to use realistic, complex activities to test whether the 
students could actually apply what they learned. For exam- 
ple, Pat combined writing, spelling, science, and career skills 
into an activity in which students wrote letters of application 
for jobs as physicists, biologists, or chemists. Pat particularly 
valued activities thai engaged students in solving interesting 
problems. For example, after a unit on optics, Pat assigned 
students to draw a diagram explaining why mirrors reverse 
an image from left to right but not from top to bottom. The 
students grappled with that problem for several days. 

Pat liked the holistic scoring procedures developed in 
these new assessment approaches. Rather than simply mark- 
ing a response correct or incorrect, Pat scored student work 
on a number of dimensions (e.g., analysis of the problem, 
clarity of communication) according to meaningful quality 
criteria. The development of authentic performance tasks and 
scoring procedures helped Pat clarify the most important 
learning outcomes. 

Pat also liked the idea of portfolio assessment, in which 
students could select and collect "best pieces" to demonstrate 
their learning and achievement during the year. Student 
self-evaluation became a valued part of this process. 

In all Pat was veiy pleased with these new assessment 
approaches and intended to continue using them. Instruction 
became more activity based and more focused on real-world 
uses of the material. There were, however, some issues that 
Pat began to think about; 

• Students with deficits in certain academic areas, notably 
writing, were at a real disadvantage. It was sometimes 
hard to determine whether an inadequate response re- 
sulted from poor writing skills, poor mastery of the con- 
tent, poor problem-solving skills, lack of creativity, or 
some combination of these factors. Pat considered allow- 
ing some students to tape record their responses, but de- 
cided not to. Wasn't writing itself an authentic task 
required in the real world? 
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• Pat wasn't sure how to use the information provided by 
these tests to plan additional instruction, particularly if a 
student was having difficulty. 

• Pat wondered how to tell whether or not an activity was 
in fact authentic, especially for students whose adult lives 
would be very different from Pat's own. 

In 1992, the Division of Innovation and Development (DID) in the 
U.S. Department of Education's Office of Special Education Programs 
and the ERIC/OSEP Special Project of The Council for Exceptional 
Children formed a Performance Assessment Working Group to discuss 
issues such as these. The term performance assessment was adopted as a 
general designation for the range of approaches that include perform- 
ance assessment, authentic assessment alternative assessment, and port- 
folio assessment. 

Performance assessment was defined has having the following 
characteristics: 

1 . The student is required to create an answer or a product rather than simply 
fill in a blank, select a correct answer from a list, or decide whether a 
statement is true or false. 

2. The tasks are intended to be "authentic/' The conventional approach 
to test development involves selecting items that represent curricu- 
lar areas or theoretical constructs, and that have desired technical 
characteristics (e.g. they correlated wiih other similar items, they 
discriminated between groups, etc.). Authentic tasks, on the other 
hand, are selected because they are "valued in their own right" 1 
rather than being ''proxies or estimators of actual learning goals." 2 

The Performance Assessment Working Group produced this series 
of four Mini-Library books on various topics related to performance 
assessment and students with disabilities. In National and State Perspec- 
tives on Performance Assessme?tt and Students with Disabilities, Martha 
Thurlow discusses trends in the use of performance assessment in large- 
scale testing programs. In Performance Assessment and Students with Dis- 
abilities: Usage in Outcomes-Based Accountability Systems, Margaret 
McLaughlin and Sandra Hopfengaidner Warren describe the experi- 



l R. L. Linn, E. L. Baker, & 5. B. Dunbar. (1991). Complex, performance-based assessment; 
Expectations and validation criteria. Educational Researcher, 20(H), IS— 21 . 
2 M. W. Kirst. (1991). Interview on assessment issues with Lorrie Shepard. Educational 
Researcher, 20(2), 21-23, 27. 
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ences of state and local school districts in implementing performance 
assessment. In Creating Meaningful Performance Assessments: Fundamental 
Concepts, Stephen Elliott discusses som2 of the key technical issues 
involved in the use of performance assessment. And, in Connecting 
Performance Assessment to Instruction, Lynn Fuchs discusses the class- 
room use of performance assessment by teachers. 

Martha J. Coutinho 
University of Central Florida 

David B. Malouf 
U.S. Office of Special Education Programs 

August, 1994 
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PART I. 
THE EVOLUTION OF 
PERFORMANCE ASSESSMENT: 
WHERE DO WE STAND? 

Testing and the assessment of children's academic progress have become 
focal issues of educational reform activities in the 1990s. Leading educa- 
tors and many consumers want assessment methods to cover content 
that represents "important" educational outcomes, to challenge students 
to use higher-order thinking skills and apply their knowledge, and to 
inform teaching (Stiggins, 1991; Wolf, LeMahieu, &Eresh, 1992). In short, 
it seems that many educational stakeholders are calling for assessment 
to drive instruction and to measure a range of achievement outcomes 
from ready -to-work skills to high-level reasoning with math and science 
concepts. This is asking a lot. However, given the assumption that "what 
you test is what you teach," this emphasis on assessment as a vehicle for 
reform should not be surprising. 

Special educators have not been involved in many aspects of recent 
educational reform efforts, but they have had much to contribute to 
assessment practices and instruction. Perhaps one reason for the omis- 
sion of special educators and "students with disabilities from the assess- 
ment leform discussions is that statewide, on-demand types of 
assessments historically have not included or accommodated many 
students with disabilities (Ysseldyke & Thurlow, 1993a). 

The literature on performance assessment and the rationale for its 
increased use reflects strong parallels to curriculum -based measurement 
(Deno, 1985; Shinn, 1989) and behavioral assessment (Kratochwill & 
Sheridan, 1990) methods. Fuchs (1994), in an accompanying book in this 
series, provides an excellent comparative analysis of these approaches 
to performance assessment. Prior to Fuchs's work, there have not been 
any direct references acknowledging these similarities or the theoretical 
and technical knowledge base for these alternative assessment methods. 




Similarly, little has been written in the special education literature about 
performance (authentic) assessment until very recently (e.g., Poteet, 
Choate, & Stewart, 1993) 

Whatever their educational background, educators share a com- 
mon desire for more assessments that are relatively low-inference, ob- 
jective measures of important learning outcomes that lead to 
instructional actions. Since performance assessments have been touted 
as offering these features, they deserve serious attention from educators 
of all students. 

Performance assessment recently has become one of the most 
written about alternative methods to norm-referenced tests. Yet, there 
are few empirical investigations into the efficacy of performance assess- 
ments and no published reports about using performance assessment 
methods to evaluate the academic functioning of students with disabili- 
ties or those who are educationally at risk. This lack of data has done 
little to slow endorsements of performance assessment, however, for as 
Madaus (1985) observed several years ago, "testing is the darling of 
policymakers across the country" (p. 5). 

Although policymakers may find testing a "darling," educators 
and psychometricians involved in the actual development and use of 
performance tests and related assessment procedures are discovering 
some significant technical and practical challenges. This book examines 
fundamental technical and implementation issues involved with large- 
scale, on-demand performance assessments and teacher-constructed, 
classroom-based performance assessments. The purposes and conse- 
quences of these two types of assessments are often very different, and 
they require examination of a wide range of issues. To gain an under- 
standing of the potential advantages and disadvantages of performance 
assessment, it is necessary to discuss its definitions and core concepts, 
examine souices of validity evidence, and analyze steps in the develop- 
ment and interpretation of an assessment task. 
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1- Definitions and Core Concepts 



Performance assessment is defined as "testing methods that require 
students to create an answer or product that demonstrates their knowl- 
edge or skills" (U,S. Congress, Office of Technology Assessment [OTA], 
1992, p. 17). It can take many forms, including conducting experiments, 
writing extended essays, or doing mathematical computations. Perform- 
ance assessment is best understood as a continuum of assessment for- 
mats ranging from the Amplest studert-constructed responses to 
comprehensive demonstrations or collections of work over time. What- 
ever format, common features of performance assessments involve 
(a) students' construction rather than selection of a response; (b) direct 
observation of student behavior on tasks resembling those commonly 
required for functioning in the world outside school; and (c) illumination 
of students' learning and thinking processes along with their answers 
(OTA, 1992). 

Performance and Authentic Assessment 

As Coutinho and Malouf (1992) noted regarding performance assess- 
ment with students with disabilities, writers have used a variety of terms 
(e.g., authentic, portfolio, alternative) to refer to assessment methods fea- 
turing student-generated responses. The term performance emphasizes a 
student's active ge> oration of a response and highlights the fact that the 
response is observable either directly or indirectly via a permanent 
product. The term authentic refers to the nature of the task and context in 
which an assessment occurs. The authenticity dimension of assessment 
has become an important issue for at least two reasons. First, most 
educators assume that the more realistic or authentic a task is, the more 
interesting it is to students. Thus, students' motivation to engage in and 
perform an "authentic" task is perceived to be much higher than it is for 
tasks that do not appear to be relevant "real-world" problems or issues. 
Second, for educators espousing an outcomes-oriented approach to 
education, it is important to focus assessments on complex sets of skills 
and conditions that are generafizable across disciplines. 
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Key Dimensions of Performance Assessment 

The term performance is consistently used by authors discussing state- 
wide, on-demand assessments for which students must produce a de- 
tailed response, whereas the term authentic is used more often by 
educators to describe teacher-constructed or -managed classroom as- 
sessment tasks that students must perform. Any serious discussion of 
educational assessment must consider the key dimensions implicit to 
both terms. 

Figure 1 highlights the performance and authenticity dimensions 
of educational assessment tasks. It also indicates that a common third 
dimension of a valid assessment task is that the content assessed repre- 
sents the content taught. Figure 1 synthesizes three key dimensions that 
educators want to manipulate in their assessments of students' achieve- 
ment: student response, nature of the task, and relevance to instruction. 

As indicated in the figure, assessment tasks can be characterized as 
varying in the degree to which they are performance in nature, authentic, 
and aligned with curriculum outcomes. For example, a low-performance 
task might be filling \ \ a bubble sheet or selecting the best answer by 
circling a letter, whereas a high-performance task might be writing and 
presenting a report of research or conducting a scientific experiment in 
a lab. Similarly, a low authenticity task might be reading isolated non- 
sense words or writing a list of teacher-generated spelling words, 
* whereas a high-authenticity task might be reading a newspaper article 
or the directions for installing a phone recording system or writing a 
letter to a friend using words that are important to the student. Finally, 
an example of a task that has a low degree of alignment with curriculum 
outlines is one in which facts and concepts are taught, but application is 
assessed; one with a high degree of alignment teaches and assesses the 
application of facts and concepts. Many educators are searching for 
assessments that are relatively high on all three dimensions. That is, they 
want highly authentic or "real-world" tasks that clearly are connected to 
their instructional curriculum and require students to produce, rather 
than select, a response. Conceptually, such tasks would lie within the 
HIGH circle in Figure 1. 

Performance assessment is not entirely new to many educators. For 
example, physical education, art, music, and vocational and technologi- 
cal arts teachers all use students' products or performances to determine 
whether or not learning objectives have been met. What is new is (a) the 
use of this form of assessment in the core ctirricular areas of math, 
science, language arts, and social studies; (b) the use of scoring criteria 
to influence and interpret performances; and (c) the encouragement of 
students to conduct self-assessments. Thus, many educators already use 
some "weak" forms of performance assessment. That is, they (a) ask 
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FIGURE 1 

The Relationship A:oriong Performance, Authenticity, and the 
Classroom Curriculum in an Assessment Task 
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students to apply their knowledge and skills by producing a product and 
(b) provide students feedback about their performances in the form of 
grades. Besides these two traditional elements of performance assess- 
ment the new, pedagogically stronger forms of performance assessment 
take steps to influence students' performances by: 

1. Selecting assessment tasks that are clearly aligned or connected to 
what has been taught 
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2. Sharing the scoring criteria for the assessment task with students 
prior to working on the task. 

3. Providing students clear statements of standards and/ or several 
models of acceptable performances before they attempt a task. 

4. Encouraging students to complete self -assessments of their per- 
formances. 

5. Interpreting students' performances by comparing them to stand- 
ards that are developmentally appropriate, as well as to other 
students' performances. 



Performance assessment is not new to educators. What is 
new is its use in core curriculum areas, the use of scoring 
criteria, and the encouragement of students to conduct 
self-assessments. 



As conceptualized here, the stronger forms of performance assess- 
ment interact with instruction that precedes and follows an assessment 
task. This approach to assessment emphasizes the point that the central 
purposes of most educational assessments are to facilitate communication among 
educational stakeholders — teachers, students, parents, administrators, employ- 
ers — and to guide instruction. 



The central purposes of most educational assessments 
are to facilitate communication among educational 
stakeholders . . . and to guide instruction. 



Reactivity and Consequences of Assessments 

Many advocates of performance assessment (e.g., Archbald & New- 
mann, 1988; Wiggins, 1993) hope that authentic/performance assess- 
ments will be reactive in guiding or influencing instruction. That is, they 
believe that if teachers use assessments requiring students to produce 
something that is valued in the real world, something that is meaningful, 
then teachers will be more likely to adjust their curriculum to focus on 
real-world outcomes that are highly valued. Many state-level perform- 
ance assessment projects currently under way seem to be counting on 
the fact that their assessment instruments (and scoring criteria) will be 
reactive. The issue of reactivity in assessment is not new, at least not to 
those familiar with behavioral assessment and the behavior change 
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research (Kazdin r 1974), In most assessment situations, assessors work 
hard to reduce or eliminate reactivity effects on the person being as- 
sessed. In the case of performance assessment; however, advocates are 
hoping that teachers and parents will react — that is, they will change 
their expectations about assessment outcomes and scoring criteria, 
which in turn will lead to improved student learning. 

Reactivity of any performance assessment probably will be greatly 
influenced by the consequences (or stakes) of the assessment. If the con- 
sequences are significant (or high-stakes), change in instruction and 
content is likely to follow; if the consequences are insignificant (or 
low-stakes), change in instruction and content is not likely to occur. 
Collectively, the issues of reactivity and consequences of assessment lead 
to the technical issue of validity, in particular consequential validity 
(Messick, 1989). These issues are discussed in greater detail, along with 
other issues of reliability and validity in a later chapter. 

From a classroom perspective, performance assessments would 
appear to allow more flexibility in the administration of tasks and offer 
an increased number of pathways for a learner to demonstrate command 
of the knowledge and skills required to accomplish a task. From an 
empirical perspective, evidence concerning the functioning of students 
with disabilities on performance assessments — whether the tasks are 
part of an on-demand large-scale assessment program or at the class- 
room level — is scarce. Given that at least 38 states in the United States 
presently are involved in the use or development of some form of 
statewide, on-demand performance assessment instruments (OTA, 
1992), it seems clear that information about the use of performance 
assessments with students with disabilities is needed. 
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2. Theoretical Matters 



Surprisingly little has been written about the theoretical aspects of 
performance assessment. Kratochwill (1992) noted the absence of a clear 
linkage to a theoretical base in the performance assessment literature and 
noted several advantages a linkage could offer developers and users of 
such performance assessment instruments. The advantages included (a) 
a conceptual framework to guide development, (b) information pertain- 
ing to empirical support for the inclusion of various assessment tech- 
nologies, (c) guidelines for evaluation and refinement of the assessments, 
and (d) a framework for using the accumulation of knowledge within 
education and psychology about learners and learning. 

Cognitive psychology and behavioral assessment are two fields 
with the potential to contribute to a theoretical base. Some authors have 
suggested that performance assessment is being driven by cognitive 
psychology, especially within the context of drawing distinctions be- 
tween content (e.g., facts, concepts) and process knowledge (e.g., proce- 
dures, applications) (Archbald, 1991). Cognitive psychologists have 
provided much knowledge about influencing students' use of problem- 
solving strategies and the application of analytical skills, abilities that are 
of great interest to educators (e.g., Gardner, 1986; Resnick & Resnick, 
1992). However, whether cognitive psychology can contribute to the 
measurement challenges in this area is highly debatable, as this approach 
is seen by many as having its most useful application in research and 
theory development rather than in real educational measurement prob- 
lems (Mehrens, 1992). 

Behavioral Assessment 

Behavioral assessment can be defined as "the identification of meaning- 
ful response units and their controlling variables for the purposes of 
understanding and altering behavior" (Hayes, Nelson, & Jarrett, 1986, p. 
464). Kratochwill noted that behavioral assessment is based on various 
theoretical models of behaviorism, including applied behavior analysis, 
neobehavioristic mediational S-R models, cognitive behavior modifica- 
tion, and social learning theory. A major characteristic of each of the 
approaches is that yarious environmental and situational influences are 
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examined for their effect on behavior. Not only are these variables said 
to influence behavior during the assessment process, but their analysis 
and manipulation are linked to the development of effective instruc- 
tional programs as well. This concern about the situational and environ- 
mental influences in an assessment parallels the motivations of 
developers of performance assessments, who want to design assess- 
ments that are authentic and challenging to learners and contain 
prompts, cues and scoring criteria that are pedagogically sound. 



Performance assessments are conceptually aligned with a 
behavior assessment model rather than a norm-referenced 
psychometric model. 



In an earlier article (Elliott, 1991), I characterized performance 
assessment as a neobehavioral approach to educational assessment be- 
cause of (a) its heavy emphasis on the use of direct observations and 
permanent products in evaluating a person's behavior and (b) its con- 
cern for the authenticity of the task or performance situation. Given that 
most performance assessments are interpreted from an ideographic or 
criterion-referenced perspective, they are conceptually aligned with a 
behavior assessment model rati^r than a traditional norm-referenced 
psychometric model. 

There is considerable knowledge to be gained from observing 
behavioral assessment in practice ard the underlying theoretical as- 
sumptions. But perhaps more important, the empirical knowledge base 
that has been derived from work on the behavioral assessment of chil- 
dren's academic and social behavior is extensive, and it could provide 
direction to those using performance assessments (e.g., Kratochwili & 
Shapiro, 1988). 



3. Current Research on 
Performance Assessment 



The current research literature on performance assessment does not 
address critical issues ill the assessment of students with disabilities. 
There is a developing Literature, however, on performance assessments 
with minority students, some of whom may be considered to be at risk 
educationally. Reviews of literature in military performance assess- 
ments by Baker, O'Neil, Jr., and Linn (1993) and in education by Baker 
(1990) have reported that less than 5% of the literature is based on 
empirical data. Accounts of the reliability of scoring procedures have 
dominated the database studies. 

Individual Differences 

Several teams of researchers have been interested in how students of 
different experiences, abilities, and ethnic backgrounds do on perform- 
ance assessment tasks. Shavelson and his colleagues conducted a series 
of studies concerning performance assessment in elementary science 
(Shavelson & Baxter, 1992, Shavelson, Baxter, & Pine, 1992; Shavelson et 
al. 1991). The researchers concluded that hands-on performance assess- 
ments in science can be produced that are reliable and capable of distin- 
guishing students who have experienced hands-on science education 
from those who have been educated largely via textbook instruction. 

The researchers also noted that their science performance assebs- 
ments correlated only moderately with the Comprehensive Test of Basic 
Skills (CTBS), suggesting that the hands-on assessments are measuring 
a somewhat different achievement construct than that assessed on a 
standardized multiple-choice test. How students with disabilities would 
perform on these hands-on science tasks is unknown at this time; how- 
ever, many of these hands-on tasks are intentionally less well defined 
than the traditional multiple-choice problems provided students (e.g., 
extraneous information may be included, procedures may not be ex- 
tracted from answer stems, etc.). Thus, organizational skills and appli- 
cation of general problem-solving skills are required for good 



performance on many of the performance assessment tasks being used 
in statewide assessment programs. Coupled with the fact that the tasks 
are timed, it is likely that a significant percentage of st .dents with 
learning disabilities would find these tasks both very challenging and 
frustrating. Accommodations and adaptations in presentation and re- 
sponse formats, time, and setting, as dictated by students' individual 
needs, would be necessary to avoid bias in testing this student popula- 
tion. Accommodations currently used in state testing programs are 
discussed in National and State Perspectives on Performance Assessment and 
Students with Disabilities by Martha L. Thurlow, in this mini-library. 

With regard to the effects of background factors such as race or 
ethnicity, the evidence is mixed as to whether performance assessments 
increase or decrease bias (Cizek, 1991; Linn, Baker, & Dunbar, 1991). The 
National Assessment of Educational Progress (NAEP) reported that 
differences between African Americans and Caucasians were about the 
same on essay tests of writing and multiple-choice tests of reading. In 
Testing in American Schools (OTA, 1992), it was reported that experi- 
ences on the California Bar Exam indicated that minority differences are 
similar or possibly greater on performance tasks. Linn, Baker, and Dun- 
bar (1991) found that African-American, Mexican-American, and Asian- 
American college students did better on direct measures of writing than 
on multiple-choice tests of written English. Using NAEP data, Koretz, 
Lewis, Skewes-Cox, and Burnstein (1992) reported that students differ 
by ethnicity in the rate at which they attempt more open-ended types of 
items. 

Thus, the evidence concerning the use of performance assessment 
tasks with students from different ethnic and cultural backgrounds is 
unclear at this time. Seme researchers have reported differences across 
cultural groups, whereas others have not. When differences have oc- 
curred, it has been unclear whether these differences were due to bias in 
content, actual differences among the participants tested, or bias in the 
scoring. 



Task Specificity 

Linn (1993) reported that experience with the use of performance-based 
measures in a variety of contexts indicates that performance on one task 
has only a weak to modest relationship to performance on another, even 
seemingly similar task. Examples of this observation can be found in 
ratings of students' written compositions (e.g., Dunbar, Koretz, & 
Hoover, 1991). Similarly, Shavelson and his colleagues, in the previously 
mentioned series of studies of performance assessment in elementary 
science, found that students' performances tend to vary widely from task 
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to task, suggesting that upward of 10 to -20 tasks may be needed to 
evaluate science achievement reliably. Linn (1993) and others (Dunbar 
et al., 1991) have concluded that although raters do contribute to some 
of the variance in the characterization of performance results, with 
careful design of scoring criteria and good training, raters can provide 
highly consistent ratings. Thus, it seems the majority of variance in the 
performances of students is contributed by differences in their abilities 
and the difficulty of the tasks tested. 

Scoring 

The research suggests that we have reasonably well-developed models 
of rater training, maintenance of scale reliability, and verification of 
raters' use of predefined scoring rubrics (Baker, O'Neil. & Linn, 1993). 
This research has been done in virtually every school subject matter and 
across multiple grade levels. In most cases, the scoring has been done on 
students' written products, although some have scored students' oral 
performances from video (Hawkins, Collins, & Frederiksen, 1990) or via 
multimedia systems (Goldman, Pellegrino, & Bransford, in press). There 
is good evidence from work done both in the United States and abroad 
(Gipps, 1993; Queensland Department of Education, 1991) that scoring 
of large-scale performance assessment is feasible and can be done with 
high interrater reliability. 

Finally, there is some evidence from statewide assessment pro- 
grams: with middle school and high school students that performance 
assessments result in relatively low levels of student performance in 
almost every subject matter area (Baker et al, 1993; Webb, 1993). This 
trend should be interpreted carefully, and it should raise questions about 
how instructional experiences are related to the assessment format, as 
well as about students' motivation to perform on these new assessment 
instruments. It seems that performance assessment tasks are more diffi- 
cult than those on traditional tests; however, until more research is 
completed, issues of task difficulty remain unanswered. 



It seems safe to conclude that performance assessment . . . 
is being advanced by dogma more than by data. 



Need for a Research Base 

Certainly, more is known about performance assessment than has been 
published in professional journals; however, it seems safe to conclude 
that performance assessment as a method to supplement, or replace, 
traditional multiple-choice tests is being advanced by dogma more than 
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data. Practically, it is an appealing approach for many educators, espe- 
cially those interested in current reforms concerning outcomes and 
standards. However, the effects on students — especially students with 
disabilities and students who may be educationally at risk — are un- 
known. What is known is that more research is needed on performance 
assessment — in particular, research on the consequences of performance 
assessment for students who have disabilities or are at risk educationally 
and for their teachers. 

The present reseirch knowledge base regarding performance as- 
sessment is limited for a variety of reasons, central to which is the issue 
of validity and its related technical aspects. Advances in the use of 
performance assessments will be concurrent with the study of their 
validity. This is as it should be, for much of the data about the validity 
of an assessment instrument can only be gathered after the instrument 
has been used. Validity is both a conceptual and a technical issue at the 
center of all assessment activities. Therefore, the remainder of this book 
focuses on validity and several related technical issues that should be 
understood by potential users of performance assessment instruments 
whether they are large-scale (statewide) or teacher-constructed, class- 
room-based assessments. 
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PART II 
TECHNICAL ISSUES IN 
DEVELOPING AND USING 
PERFORMANCE ASSESSMENT 
INSTRUMENTS 

Implementation of performance assessments and related policy making 
are prompting test developers and psychometricians to rethink funda- 
mental concepts of assessment and examine methods for ensuring high- 
quality assessment instruments. The technical challenges confronting 
test developers, especially with regard to large-scale or statewide per- 
formance assessments, are compounded by issues of time constraints 
and high-stakes use of results. Performance assessments generally are 
time intensive, and this creates practical dilemmas concerning breadth 
and depth of coverage. Performance assessments often have been pro- 
moted by policymakers who want to use their results to elevate stand- 
ards of educational performance for individuals and entire schools. As 
the consequences for performances are increased; so are concerns about 
the comparability and corruptibility of performance assessment instru- 
ments (Baker et aL, 1993; Linn, 1993: Mislevy, 1992). The dual concerns 
of time constraints and assessment consequences frequently associated 
with statewide assessments are also an issue of concern to classroom 
teachers and their students. 
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4. Validity and Its Bases 



Central to the development and use of any assessment instrument is the 
conceptualization of validity. The validity of an assessment depends on 
the degree to which the interpretations and uses of assessment results 
are supported by empirical evidence and logical analysis. Thus, valida- 
tion of an assessment instrument or process requires an evaluation of 
interpretations of results, as well as the intended and unintended conse- 
quences of using the assessment. In focusing on the consequences of an 
assessment, it becomes apparent that validity issues are in many ways 
issues of values. 

Key Assumptions 

Kane's (1992a, 1992b) work on validating performance assessments 
provides insights into key assumptions underlying the validation proc- 
ess and the relationship between validity and reliability. Specifically, 
Kane noted four key assumptions; 

1. The domain of tasks from which the sample is drawn (the target 
domain) is appropriate for the skill being assessed. 

2. Performance on a sample of tasks from the domain has been 
observed and evaluated in an appropriate way. 

3. One can generalize from performance on the sample of tasks to 
expected performance over the domain of tasks. 

4. There are no extraneous factors that have an undue influence on 
the results of the performance test. (1992a, p. 10) 

Evaluative Criteria 

Criteria for evaluating the validity of tests and related assessment instru- 
ments have long existed (e.g., Buros, 1933) and have been written about 
extensively (e*g., Cronbach, 1990; Wiggins, 1993). A joint committee of 
the American Educational Research Association, American Psychologi- 
cal Association, and National Council on Measurement in Education 
(1985) developed a comprehensive list of standards for tests that stressed 
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the importance of construct validity. Extrapolating from this document, 
Baker and her associates (1993, p. 1214) enumerated five internal char- 
acteristics that valid performance assessments should exhibit: 

1. Have meaning for students and teachers and motivate high per- 
formance. 

2. Require the demonstration of complex cognitions . . . applicable to 
important problem areas. 

3. Exemplify current standards of content or subject matter quality. 

4. Minimize the effects of ancillary skills that are irrelevant to [the] 
focus of assessment. 

5. Possess explicit standards for rating or judgment. 

Given the approach to performance assessment that most states 
seem to be taking and the stated assumptions underlying the validation 
process, the most compelling forms of evidence needed to validate a 
performance assessment are generalizability data, interrater/interscorer 
reliability data, and judgments of the importance of the knowledge and 
skills required to successfully complete the tasks. Thus, evidence for the 
validity of a test or assessment instrument takes two forms: (1) how the 
test or assessment instrument "behaves" given the content covered and 
(2) the effects of using the test or assessment instrument. 

Questions commonly asked about a test's "behavior" concern its 
relation to other measures of a similar construct, its ability to predict 
future performances, and its coverage of a content domain. Questions 
about the use of a test typically focus on the test's ability to reliably 
differentiate individuals ; nto groups and to guide the methods teachers 
use to teach the subject matter covered by the test. Some questions arise 
about unintended uses of a test or assessment instrument. For example: 
Does use of the instrument result in discriminatory practices against 
various groups of individuals? Is it used to evaluate others (e.g., parents 
or teachers) who are not directly assessed by the test? 

Messick (1988) best captured the complexities of judging an assess- 
ment instrument's validity, characterizing it as "an inductive summary 
of both the adequacy of existing evidence for and the appropriateness of 
potential consequences of test interpretation and use" (p. 34). Thus, 
Messick corrected the common misconception that validity lies within a 
test and went on to conceptualize validity as resting on the following 
fo^r bases: 
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(1) an inductive summary of convergent and discriminant 
evidence that the test scores have a plausible meaning or 
construct interpretation, (2) an appraisal of the value impli- 
cations of the test interpretation, (3) a rationale and evidence 
for the relevance of the construct and the utility of the scores 
in particular applications, and (4) an appraisal of the potential 
social consequences of the proposed use and of the actual 
consequences when used, (1988, p. 42) 

Figure 2 provides a graphic representation of Messick's four bases 
of test validity. This figure indicates that test validity can be represented 
in terms of two facets connecting the source of the justification (i.e., 
evidential basis or consequential basis) to the function or outcome of the 
testing (i.e., interpretation or use). According to Messick (1988), this 
crossing of basis and function "provides a unified view of test validity" 
(p. 42). In a more recent article, Messick (1994) elaborated on the inter- 
play between evidence and consequences in the validation of perform- 
ance assessments. He concluded that, like all assessments, performance 
assessments "should be validated in terms of content, substantive, struc- 
tural, external generalizable, and consequential aspects of construct 
validity" (p* 22). Messick went on to advise developers of performar.ee 
assessments to use a "construct-driven rather than a task-driven ap- 
proach . . . because the meaning of the construct guides the selection or 
construction of relevant tasks as well as the rational development of 
scoring criteria and rubrics" (p. 22). 



FIGURE 2 
Facets of Test Validity 
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Note. From Messick (1988). p. 42 
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5. Technical Challenges 
Associated with 
Performance Assessments 



Most of the statewide performance assessments being developed in the 
United States are apparently intended to be high stakes. In other words, 
their results would lead to significant consequences both for individuals 
and for schools or school districts. Given this assumption, the technical 
qualities of the instruments and the scoring procedures must meet high 
standards for reliability and validity, 

The twin hallmarks of traditional tests, reliability and validity, 
require close examination and extension, because the new models of 
assessment, including performance assessments, are "located conceptu- 
ally midway on the continuum between construct approaches and 
ideographic demonstration of complex performance" (Baker, G'Neil, Jr., 
& Linn, 1993, p. 1214) . Four related clusters of conceptual issues domi- 
nate most discussions about providing evidence for the reliability and 
validity of performance assessment instruments. These are: (1) assess- 
ment as a curriculum event, (2) task content alignment with curriculum 
and important educational outcomes, Q) scoring of results and sub- 
sequent communications with consumers, and (4) linking and compar- 
ing results over time. 

Assessment as a Curriculum Event 

The conceptualization of an assessment as a curriculum event is the 
direct result of work in language arts performance assessment. Many 
language arts educators see externally mandated assessments not only 
as insensitive to the integrity of the language arts instruction but also as 
demanding performances that are at odds with the performances that 
occur naturally in conjunction with integrated language arts instruction 
(Witte & Vander Ark, 1992). To overcome their concerns about externally 
mandated assessments, language arts educators have reconceptualized 
a test as a curriculum event — that is, as a series of theoretically and 
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practically coherent learning activities structured in such a way that they 
lead to a single predetermined end. 

The conceptualization of assessment as a curriculum event has 
ramifications for the content of an assessment instrument, the length and 
types of activities required to complete the assessment, the number of 
items in the assessment instrument, and the scoring. This conceptualiza- 
tion of assessment as a curriculum event presents some significant 
challenges to a traditional test theory model and subsequent tactics for 
documenting reliability and validity* Conversely, the conceptualization 
is consistent with many educators' perspectives of what assessment 
ought to be like, because it incorporates qualities that should make the 
results more meaningful and useful to educators and students alike. 



Ser Jng Assessment as a curriculum event . . , should make 
the results more meaningful and useful to educators and 
students alike. 



•Regardless of subject matter area, as performance assessments take 
on the characteristics of strong performance assessments outlined ear- 
lier, the lines between what is assessment and what is instruction blur. 
Some performance assessment advocates (e.g., Wiggins, 1992) say this is 
as it should be, but for those concerned about documenting the technical 
qualities of an assessment, this blurring adds a significant burden to the 
task. Issues of instructional opportunities and equity also become more 
salient a? the lines between assessment and instruction blur (Task Force 
Report on Cultural and Linguistic Diversity in Education, 1993). 

Task Content Alignment with Curriculum 

The perception that much of what is tested is not relevant or has not been 
taught in a classroom has been a source of concern to many educators 
and students. Another perception is that how information is tested 
influences students' performances. The concern about test content is 
multifaceted and is of central relevance to judgments about the reliability 
and validity of any assessment instrument* The relationships among 
curriculum content, task content, and important educational outcomes 
often are not under the control of one group. For statewide assessments, 
a strategy of using experienced classroom teachers to play the major role 
in the development of test items and materials is necessary to increase 
the likelihood that the content of assessments will be consistent with 
what is or can be taught in the classroom and with what is highly valued 
as an outcome of education. 
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Given the limited number of items that can be presented in a 
performance assessment as conceptualized in most statewide assess- 
ment programs, issues of domain definition and content sampling from 
the domain become critical. Specifically, there is a need for personnel 
involved in the development of statewide instruments in the content 
areas to provide a definition of their subject domain so that items and 
materials that are representative of various aspects of the domain can be 
sampled yearly. This issue of domain sampling also is highly relevant to 
discussions of task and instrument difficulty, and ultimately to judg- 
ments of comparability of instruments over time. 



In statewide assessments, the use of experienced 
classroom teachers to develop test items and material 
increases the alignment of the assessment with the 
curriculum and with valued educational outcomes. 



Content alignment between what is tested and what is taught 
generally is less of an issue with teacher -developed performance assess- 
ments than with formal statewide assessments. The related issue of what 
content should be covered and assessed for all students is a concern at 
the local level, and it is a particularly salient one for educators serving 
students with disabilities. Work done at the National Center on Educa- 
tional Outcomes at the University of Minnesota, funded by the U.S. 
Department of Education; provides educators with a meaningful f raine- 
work for conceptualizing valued outcomes for all students and formu- 
lating means for assessing these outcomes (Ysseldyke & Thurlow, 
1993b). Performance assessments are part of the recommended assess- 
ment packet. 

Scoring and Subsequent Communications with 
Consumers 

The accurate and meaningful scoring of performance assessments is 
predicated on the development of descriptive scoring criteria, evaluative 
standards that are well-understood and developmentally appropriate, 
and well-trained raters. Typically, a general framework for scoring 
performance tasks features the use of Likert-type scales with multitrait 
anchoring descriptions that raters must use to characterize students' 
responses to each item* The anchor point labels are likely to be consistent 
across subject matter. For example, top performances are often charac- 
terized as "Exemplary Performances," whereas extremely poor perform- 
ances, regardless of subject matter, are characterized as "Inadequate 
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Performances." Responses that are largely incomplete are characterized 
as "Unscorable Performances/' Within this general scoring approach, 
ordinal category scores are rendered for various knowledge and skill 
area domains in a subject. 

As presently conceptualized, the scoring and interpretation of 
performance assessment instruments is akin to a criterion-referenced 
approach to testing. A student's performance is evaluated by a trained 
rater who compares the student's responses to multitrfiit descriptions of 
performances and gives the student a single number corresponding to 
the description that best characterizes the performance. Students are 
compared directly to scoring criteria and only indirectly to each other. 

Given this general approach to scoring (whereby teachers trained 
as scorers rate performances- of students), high (e.g., 90%) interrater 
agreements and "blind'' ratings (e.g., teachers rate students from schools 
other than those in which they teach) are essential features of a statewide 
performance assessment system. This approach to scoring is more labor 
intensive and time consuming than -aditional multiple-choice tests, 
although it has the benefit of educatin 0 . jachers about the content of the 
test and the characteristics of exemplary performances. Time and costs 
for scoring can be very high; however, they can be reduced if a sampling 
approach, rather than a census approach, is taken to scoring students' 
responses. In other words, instead of scoring all students' responses for 
the state's purposes of monitoring school and district performances, a 
representative sample (e.g., 20% to 25%) of the students from each school 
could be scored by raters from outside the school district. The tests for 
the remaining students could be scored locally by teachers if so desired. 
Thus, a local district could get detailed individual reports of perform- 
ances for all of their students, while the state would get performance 
results for groups of students by grade within schools or school districts. 



Time and costs for scoring can be reduced if a sampling 
approach, rather than a census approach, is taken. 



When using performance assessment at the classroom level, it is 
unlikely that teachers would use a sampling approach to scoring. Every 
student needs feedback when the purpose of assessment is diagnosis and 
monitoring of student progress. Many advocates of performance assess- 
ment encourage teachers to show students how to assess their own 
performances. This is possible when the scoring criteria are well articu- 
lated and teachers are comfortable with having students share in their 
own evaluation process. Many special educators have used self- 
monitoring or self -evaluation interventions for years with some of their 
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students, so it seems reasonable that they would be receptive to some of 
the self-assessment aspects of performance assessment. 

Lmking and Comparing Results Over Time 

Several of the major outcomes-related questions that have stimulated the 
development of performance assessment instruments by states concern 
comparisons of students over time and across grade levels. Therefore, 
methods are needed to facilitate reliable and meaningful comparisons of 
students' performances. Statistical and judgmental methods have been 
developed to accomplish this goal. Linn (1992) has referred to these as 
linking methods. Linking, according to Linn, is a generic term that 
includes a variety of approaches to making results of one assessment 
comparable to those of another, A variety of other terms (e.g. f anchoring, 
benchmarking, calibration , equating, prediction, projection, scaling, statistical 
moderation, social moderation, verification, and auditing) have been used to 
characterize approaches to comparing results from different assess- 
ments. Tables 1 and 2 provide a description of the most frequently used 
terms and requirements for using the methods. 

For most purposes of performance assessments, the two ap- 
proaches to linking that seem most appropriate and manageable are 
statistical moderation and social moderation. The statistical moderation 
approach is used to compare performances across content areas (e.g., 
math to language arts) for groups of student, who have taken a test at 
the same point in time. The social moderation approach to linking is a 
judgmental approach that is built on consensus of raters. In the use of 
social moderation, the comparability of scores assigned depends sub- 
stantially on the development of consensus among professionals. This 
process serves to verify samples of performances at successively higher 
levels in a system (e.g., class, school, district, and state) and to function 
as an audit. As noted in Table 2, the social moderation approach substi- 
tutes requirements for developing professional consensus regarding 
standards and exemplars of performances meeting those standards for 
the more familiar measurement and statistical requirements associated 
with statistical moderation. Linking assessments over a period of 1 or 2 
years generally is not a concern of classroom teat hers, who use perform- 
ance-based assessments to evaluate individual students in each new 
class, rather than to directly compare classes over the course of several 
years. 
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TABLE 1 

Description of, Requirements for, and Examples of 
Five Forms of Linking Distinct Assessments 



Form of 
Linking 



Description 



Requirements 



Example 



Equating 



Calibration 

(includes 

vertical 

equating) 



Statistical 
Moderation 



Strongest form of 
linking. Any 
interpretation justified 
for one form of test is 
also justified for 
equated form. 



Different techniques 
of linking generally 
needed to support 
individual and group 
interpretations. 
Linking giving the 
right answer to most 
likely score on other 
form for individual 
student, in general, 
will give the wrong 
answer to questions 
about distributions for 
groups (e.g., percent 
of students instate 
scoring at the 
advanced level on 
NAEP) and vice versa. 

Comparisons aie 
made among scores 
provided by different 
sources (e.g., teacheio) 
or different subject 
matter areas (e g., 
English, mcith, 
history). Statistical 
moderation is used to 
ad|ust scores in an 
effort to make them 
"comparable/' 
P"»mpa!dbihtv i* 
imperfect and may- 
give unknown 
advantage to one 
locale or content area 
relative to another 



Most demanding form 
of linking. Forms must 
measure the same 
construct with equal 
degree of reliability. 
Forms are 
interchangeable. 



Must measure the 
same construct. But 
may differ in 
reliability. May also 
differ in the level at 
which the measures 
are most useful (e.g., 
forms designed for 
students at different 
grade levels). 



Some external 
examination or anchor 
measure is needed to 
adjust lrval scores or 
scores on ^:fierent « 
subject area 
examinations. Utility 
depends heavily on 
the relevance of the 
external examination 
or anchor test and on 
the strength of its 
relationship with the 
locally defined or 
subject area 
examinations. 



New versions of a 
state test used to 
certify high school 
graduates are 
introduced each year. 
It is desired that the 
score required for 
graduation is 
equivalent from one 
year to the next. 

A state uses a version 
of a test that is shorter 
than a national test 
but designed to 
measure the same 
skills. The state 
version is less reliable 
than the national test 
due to its reduced 
length. Estimates of 
the percentage of 
students in the state 
who score above 
selected points on the 
national test are 
desired. 



As assessment system 
consists uf a 
combination of 
extended response 
questions that are 
scored locally by- 
teachers and a 
standardized test that 
is administered under 
controlled conditions 
and scored Centrally. 
The st a nd a rd i zed t est 
is used to adjust for 
between-school 
differences in teacher 
assigned scores on the 
locally-scored 
questions. 

continues 
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TABLE 1 (Continued) 



Form of 








Linking 


Description 


Requirements 


Example 



Prediction 

(also called 
projection) 



Social 

Moderation 

(also called 

consensus 

moderation, 

auditing, 

verification) 



The weakest form of 
statistical linking. The 
predictions are heavily 
dependent on context, 
group used to 
establish a 

relationship, and time. 
Predictions that hold 
for one group (e.g., 
males) may not hola 
for another (e.g., 
females) or for a 
combination of groups 
(e.g., males and 
females). 

Performances on 
distinct tasks are rated 
using a common 
framework and 
interpreted in terms of 
a common standard 
(e.g., essays written in 
response to prompts 
used in state A and 
different prompts in 
state B are interpreted 
in terms of the same 
national standards). 



Predictions can be 
made as long as there 
is a relationship 
between measures. 
The precision of the 
prediction will 
deptfud on the 
strength of the 
relationship. Due to 
sei^itivity of 
prediction to context, 
group, and time, the 
prediction needs to be 
re-evaluated 
frequently. 

The primary 
requirements ace 
concerned with the 
development of a 
consensus on 
definitions of 
standards and on the 
performances that 
meet those standards. 
Staff development and 
review of 
discrepancies in 
ratings are critical. 
Ratings assigned by 
local teachers may be 
compared to 
independently 
assigned ratings from 
other raters and the 
latter may be used to 
adjust local scores. 
Documentation needs 
to be provided 
regarding the degree 
to which different sets 
of judges agree that 
given responses to 
different tasks meet 
common standards. 



Performance on a 
multiple-choice iest 
is obtained and 
used to predict 
performance on an 
essay test. Different 
prediction systems are 
used for predicting 
performance for 
individual students 
and for predicting 
group distributional 
characteristics. 



State* or groups of 
states develop their 
own sets of 
performance-based 
assessments in 
reference to a 
common content 
framework. Scoring 
of performance 
depends heavily 
on professional 
judgments of teachers 
and a system of 
spot checks and 
vesication. 
Nonetheless, it is 
expected that 
performance of 
individual students, 
schools, districts and 
states will be 
compared to a single 
set of national 
standards. 



Source. Linn (1992) 
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Requirements 
Linking 



TABLE 2 
of Different Techniques of 
Distinct Assessments 



Type of Linking 





EQ 1 




STAT 




cnr 


Requirements for Assessments 


CAL 2 


MOD 3 


PRE 4 " 


MOD 


Measure same thing 


yes 


yes 


no 


no 


no 


(construct) 








Equal reliability 


yes 


no 


no 


no 


no 


Equal measurement 


yes 


no 


no 


no 


no 


precision throughout the 










range of levels of student 












achievement 












A common external 


no 


no 


yes 


no 


no 


exam zna tion 










Different conversion to 


no 


maybe 


NA 


yes 


no 


go from test X to Y than 








irom Y to A 












Different conversions for 


no 


yes 


no 


yes 


no 


estimates for individuals 




and for group 












uistriouoonai 












characteristics 








» 




Frequent checks for 


no 


yes 


yes 


yes 


yes 


stability over contexts, 




groups, and time required 












Consensus on standards 


no 


no 


no 


no 


yes 


and on exemplars of 










performance 












Credible, trained judges to 


no 


no 


no 


no 


yes 


make results comparable 











Equating 
Calibration 
Statistical Moderation 
Predication 
Social Moderation 

Source Linn (1992). 



25 



10 



6. Information Needed to Advance 
the Use of Performance 
Assessments 



Educators have covered significant conceptual ground in the develop- 
ment and use of performance assessment instruments with students in 
the mainstream. Little, however, has been done to ensure or at least 
understand how students with disabilities are likely to be affected by 
performance assessments. Many issues, both technical and practical, 
remain The fundamental issue jf inclusion of students with disabilities 
in state and national assessment programs, many of which are or soon 
will be including performance assessment components, recently was 
addressed in an NCEO report (Ysseldyke & Thurlow, 1993a) titled 
Inclusion andTesting Accommodations for Students with Disabilities. Reschly 
(199w3) explained that inclusion decisions historically have varied with 
specific outcome domains and the stakes of results. Generally, as the 
consequences of an assessment increase, there has been an unwarranted 
exclusion of students with disabilities. Reschly and several others (i.e., 
Algozzine, 1993; Reynolds, 1993) who authored position papers f or the 
NCEO report argued for full or nearly full inclusion of student, with 
disabilities. In contrast, Merwin (1993) took the position that it is accept- 
able to exclude children with disabilities because they represent a rela- 
tively small number of students. Merwin did acknowledge, however, 
that validity research on the performances of students with disabilities 
oji large-scale assessments was still needed. He suggested that the 
inclusion issue is directly related to several other technical issues such 
as reliability, score aggregation,. and sampling. 

Based on the existing research literature on performance assess- 
ments and awareness of general concerns about the assessment of stu- 
dents with disabilities, the following short list of issues is recommended 
for conceptual work and empirical investigations: 
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Generally, as the consequences of an assessment 
increase, there has been an unwarranted exclusion of 
students with disabilities. 



• Fairness or Bias- Methods for preventing or reducing bias must be 
enumerated, and procedures for determining bias must be built in 
to the field testing and scoring of the various performance 
assessments. 

• Comparability of Tasks. Methods for documenting the nature of 
performance tasks must be developed so variables of task content, 
task difficulty, and task value can be operationalized and 
communicated. 



• Consequences of Performance Assessments. Procedures for 
documenting the intended and unintended consequences of 
performance assessments must be designed, and related data 
should be collected as part of the field-testing programs. In 
addition, data on costs for conducting large-scale performance 
assessments need to be gathered. 

Performance assessments hold significant promise for enhancing 
the instruction and evaluation of students, yet before they become a 
viable reality for use with all students, many technical questions require 
empirical and pragmatic, answers. In addition, educators need to ad- 
dress the following implementation issues: 

1 . Selection of educational outcomes to guide assessments. 

2. Identification of indicators of progress toward targeted outcomes. 

3. Development of methods for assessing performance on these indi- 
cators and/or outcomes. 



4. Development and refinement of criteria for scoring students' per- 
formances and standards for interpreting performances. 

5. Creation of teacher training opportunities to enhance under- 
standing and use of performance assessments. 

6. Stimulation of system support and resources needed to facilitate 
alternative assessments. 
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Collectively, these implementation concerns represent significant 
pieces of the performance assessment puzzle (Elliott, 1993; see Figure 3). 
They require the attention of researchers and educators alike if perform- 
ance assessment practices are to be usable with a large number of 
students with disabilities. 



FIGURE 3 

implementation Pieces of the Performance Assessment Puzzle 




BEST COPY AVAILABLE 
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7. Conclusion: Proceed 
with Caution 



Expectations for performance assessments seem extremely high, yet the 
conceptual and technical underpinnings of these assessments are just 
beginning to be understood. Traditional models of test development 
and validation are being stretched, instructional practices are being 
reconceptualized with assessment as a centerpiece, and statements 
about rigorous outcomes for all students are being made more loudly 
and frequently. Changes are occurring in many educational locales. 
Some of the suggested changes are theoretically and practically familiar 
to special educators, who have for decades been adapting assessments 
to provide more insights into instruction. For example, adoption of 
behavioral assessment models that honor individual variability as 
meaningful information (rather than as measurement error), and use of 
mastery learning approaches where what is taught is tested and re- 
tested, are common approaches shared by many special educators and 
current reformers of educational assessment practices. More common 
ground is needed, however, if students with disabilities are to share in 
the purported benefits of performance assessments. The first practical 
step is one of inclusion — inclusion in statewide assessment programs, 
in local educators' discussions of classroom assessment, and in the 
research programs of the technical experts studying the characteristics 
and qualities of performance assessments. 

Performance assessment is a promising philosophy and method 
of assessment that may have some significant practical benefits for all 
students and educators. Strong forms of performance assessments are 
achievable in the classroom, where teachers have control of instruc- 
tional outcomes and the instructional environment so that assessment 
criteria and feedback can be used to enhance learning. Apparently, all 
students can benefit from the classroom use of performance assess- 
ments, and many of the technical concerns are minimized at the class- 
room level, given the lower stakes associated with classroom-based 
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decisions. With regard to the use of performance assessments in state- 
wide assessment programs, where stakes are presumed to be high, more 
data are needed to temper dogma and ensure quality. 
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The ERIC/OSEP Special Project 



The ERIC/OSEP Special Project at The Council for Exceptional Children 
facilitates communication among researchers sponsored by the Office of 
Special Education Programs (OSEP) in the U.S. Department of Educa- 
tion, and it disseminates information about special education research 
to audiences involved in the development and delivery ot special edu- 
cation services. These audiences include 

• Teachers and related services professionals. 

• Teacher trainers. 

• Administrators. 
« Policymakers. 

• Researchers. 

The activities of the ERIC/OSEP Special Project include tracking 
current research, planning and coordinating research conferences, and 
developing a variety of publications that synthesize or summarize recent 
research on critical issues and topics. Each year, the Special Project hosts 
a conference attended by research project directors sponsored by OSEP. 
Throughout the year, it holds research forums and work groups to bring 
together experts on emerging topics of interest. Focus groups repre- 
senting the Special Project's audiences are held to inform both OSEP and 
the Special Project of audience information needs and to enhance the 
utility of publications produced by die Special Project. These publica- 
tions include an annual directory of research projects as well as publica- 
tions about current research efforts. 

The ERIC/OSEP Special Project is funded under a three-party 
contract between The Council for Exceptional Children, the Office of 
Special Education Programs, and the Office of Educational Research and 
Improvement, U. S. Department of Education. Under this contract, OSEP 
funds the ERIC/OSEP Special Project, and OERI funds the ERIC Clear- 
inghouse on Disabilities and Gifted Education. The ERIC Clearinghouse 
on Disabilities and Gifted Education is one of 16 clearinghouses of the 
Educational Resources Information Center (ERIC) system, which main- 
tains a database of over 440,000 journal annotations and 340,000 docu- 
ment abstracts concerning education. The ERIC Clearinghouse on 
Disabilities and Gifted Education gathers and disseminates information 
on all disabilities and on gif tedness across age levels. 
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